Document clustering using the LSI subspace signature model
نویسندگان
چکیده
We describe the Latent Semantic Indexing Subspace Signature Model (LSISSM) for semantic content representation of unstructured text. Grounded on Singular Value Decomposition (SVD), the model represents terms and documents by the distribution signatures of their statistical contribution across the topranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between LSI term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and Self-Organizing Maps compared to Vector Space Model (VSM) and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared to random seeding procedure which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.
منابع مشابه
Clustering and Active Learning Using a LSI Subspace
.......................................................................................................... xiv CHAPTER1: Introduction......................................................................................... 1 1.1 Latent Semantic Indexing .......................................................................... 4 1.2 Visual Exploration of the LSI Subspaces..........................
متن کاملA Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering A Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering
Increasingly large text datasets and the high dimensionality associated with natural language is a great challenge of text mining. In this research, a systematic study is conducted of application of three Dimension Reduction Techniques (DRT) on three different document representation methods in the context of the text clustering problem using several standard benchmark datasets. The dimensional...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملComparing Dimension Reduction Techniques for Document Clustering
In this research, a systematic study is conducted of four dimension reduction techniques for the text clustering problem, using five benchmark data sets. Of the four methods -Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) -ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of t...
متن کاملA Mutual Subspace Clustering Algorithm for High Dimensional Datasets
Generation of consistent clusters is always an interesting research issue in the field of knowledge and data engineering. In real applications, different similarity measures and different clustering techniques may be adopted in different clustering spaces. In such a case, it is very difficult or even impossible to define an appropriate similarity measure and clustering criteria in the union spa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JASIST
دوره 64 شماره
صفحات -
تاریخ انتشار 2013